This lecture provides an introduction to Neural Network and Convolutional Neural Network. We will see how neural network borrows the idea of cognitive science and models the behavior of neurons so as to mimic the perception process of human being. On high level, Neural Network is a “parametric” model with huge number of unknown parameters. It is easy to understand the entire set up with knowledge of regressions. Thank to the CS people who has been developing efficient, fast computation algorithm we are able to estimate millions unknown parameters within no time.
Though the neural network model is easy for us to understand but there are a set of new terminologies introduced. Based on the neural network model, we will set up the architecture (model specification) with input layer, hidden layers and output layer and apply the package to several case studies. A well known case study MNIST will be carried out here through the basic NN.
Since the most successful application of deep learning (Neural Network with more layers) is image recognition and Language processing among others, we will also explore the image-specialized Convolutional Neural Network (CNN) and see how it performs compared to the regular neural network on MNIST. The core of CNN is the convolutional layer and the pooling layer.
While keeping this lecture compact with core ideas about NN we have left many details out. If you are interested to learn more, you can refer to this wonderful Neural Networks course. We will use some graphic illustrations from there. Also a well organized website with data mining elements including Neural Networks can be found HERE.
Acknowledgment: This lecture is built gradually and expanded to the current real implementable shape. Several former TAs have been contributing to the success of current version. It is impossible without their help. Junhui Cai (current Ph.D student in the stat dept), John Brooke Waldt (former MS student in CS department) and Ryan Galea (former MBA student), I thank them!
Table of Contents
YELP_tm_freq.csvMNIST available with kerasAppendix
Note:
Ubuntu dual bootPython is another commonly-used programming language, especially in machine learning community. A lot of popular libraries, including Tensor flow and Keras that we will use in this lecture, are developed in Python. Hence, we need to install Python first and use R interfaces to interact with those libraries. Here’s the link to Python Anaconda distribution (version). Anaconda is a package manager and provides a one stop shop to install Python (and R) and some of the most popular libraries for data science. Anaconda Python also optimizes for numerical computation (because it is built against Intel’s math kernel library). Note that for Windows users, in order to use Keras v2.2.4, you have to install Anaconda distribution; otherwise for Mac/Linux users, you can either use Anaconda Python or Vanilla Python.
TensorFlow is a machine learning library developed by Google. Keras is a high-level Application Programming Interface (API) for TensorFlow to build and train deep learning models. It runs on top of TensorFlow and is used for fast prototyping, advanced research, and production. An R interface to Keras by RStudio is available. Here is an official installation guide.
To install keras run the following chunk:
devtools::install_github("rstudio/keras")
library(keras)
install_keras() # a cpu based tensorflow installed
#install_keras(tensorflow = "gpu")For further assistance see: https://keras.rstudio.com/ - Detailed functions can be found in /API
We have discussed various machine learning methods in previous lectures. Deep learning is another one of these approaches. It attempts to learn from data successive layers of increasing meaningful representation.
Let’s consider a classification problem. For each sample \((X_i, Y_i)\) where \(i = 1, \ldots, n\), we have
Given the features/predictors, we would like to build a probability model for y (categorical). We then can predict y accordingly. Or build a predictive equation for a continuous response y.
Logistic regression is most popular model to model the relationship between probability of one level of y vs. all the predictors. For example, we have 3 covariates \((x_1, x_2, x_3)\) and a label \(y \in \{0,1\}\). We model probability of \(P(y=1|x_1, x_2, x_3)\) through a linear logit link function:
\[logit(Prob(Y=1 | X_1, X_2, X_3)) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 \]
Or equivalently this is same to model \[Prob(Y=1 | X_1, X_2, X_3) = \frac{e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3}} {1+ e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3}}\]
Let us rephrase the model process in the following neural network with only an input layer and an output layer.
The following figure illustrates the model above.
Logistic regression as 0-layer neural network
We only need to estimate the first set of unknown parameters: \(b_{1}^{(1)}, W_{11}^{(1)}, W_{12}^{(1)}, W_{13}^{(1)}\) since \[P(Y=1) + P(Y=0) = 1 \]
The above classical logistic regression model is a simple network:
0 layer
softmax as output (meaning logit function)
classification is done through thresholding the probability
it is simple and we can interpret the probability model
BUT, it is restrictive for the model format.
A natural way to expand the logistic regression model is to include non-linear equations for \(s_1\) and \(s_2\). One way to do it is by creating many intermediate composite functions sequentially, starting from the input, layers by layers to the output layer. This is called Neural Network.
Neural network is easily described by a diagram or so called architecture. It shows the model between input and output. A neural network consists of three major layers: input layer, hidden layers and output layer. When we say \(N\)-layer neural network, we do not count the input layer as convention.
As a simple example, consider the setup: we have (x_1, x_2, x_3, y), where there are 3 features and one binary response. Let’s start with the following neural network with one hidden layer with one hidden layer consists of 4 neurons. We have 3 features for each data point and 2 classes to predict.
1-layer Neural network model
For each data point \((x,y)\),
Input layer consists of 3 inputs units (3 features): \(x_1, x_2, x_3\)
Hidden layer consists of 4 neurons. Each neuron is given by first taking a linear combination then apply an activation function \(f_1\) \[a_1^{(1)} = f_1(b_{1}^{(1)} + W_{11}^{(1)}x_1 + W_{12}^{(1)}x_2 + W_{13}^{(1)}x_3)\] \[a_2^{(1)} = f_1(b_{2}^{(1)} + W_{21}^{(1)}x_1 + W_{22}^{(1)}x_2 + W_{23}^{(1)}x_3)\] \[a_3^{(1)} = f_1(b_{3}^{(1)} + W_{31}^{(1)}x_1 + W_{32}^{(1)}x_2 + W_{33}^{(1)}x_3)\] \[a_4^{(1)} = f_1(b_{4}^{(1)} + W_{41}^{(1)}x_1 + W_{42}^{(1)}x_2 + W_{43}^{(1)}x_3)\] where \(a_i^{(1)}\) denotes the \(i\)-th activated neuron in hidden layer 1.
\(f_1(x)\) is called an activation function. Here we take ReLU function: \[f_1(x) = \max (0, x) \]
Output layer consists of 2 output units \[s_1 = b_{1}^{(2)}+W_{11}^{(2)}a_1^{(1)} + W_{12}^{(1)}a_2^{(1)} + W_{13}^{(1)}a_3^{(1)} + W_{14}^{(1)}a_4^{(1)}\] \[s_2 = b_{2}^{(2)}W_{21}^{(2)}a_1^{(1)} + W_{22}^{(1)}a_2^{(1)} + W_{23}^{(1)}a_3^{(1)} + W_{24}^{(1)}a_4^{(1)}\]
Softmax: models: \[Prob(Y=1|x_1, x_2, x_3) = \frac{e^{s_1}}{e^{s_1}+e^{s_2}} = \frac{e^{b_{1}^{(2)}+W_{11}^{(2)a_1^{(1)} + W_{12}^{(1)}a_2^{(1)} + W_{13}^{(1)}a_3^{(1)} + W_{14}^{(1)}a_4^{(1)}} }}{e^{s_1}+e^{s_2}}\] \[Prob(Y=0|x_1, x_2, x_3) = \frac{e^{s_2}}{e^{s_1}+e^{s_2}}=\frac{e^{ b_{2}^{(2)}W_{21}^{(2)}a_1^{(1)} + W_{22}^{(1)}a_2^{(1)} + W_{23}^{(1)}a_3^{(1)} + W_{24}^{(1)}a_4^{(1)}} }{e^{s_1}+e^{s_2}} \]
Unknown parameters: All weights \(W^{(1)}\)’s, \(b^{(1)}\)’s and \(W^{(2)}\)’s, \(b^{(2)}\)’ are unknown parameters.
To write in a compact way, we have the following unknown: the hidden layer \(W^{(1)}\) is a matrix of \(3\times 4\), \(b^{(1)}\) is \(4\times 1\) and the output layer \(W^{(2)}\) is \(4\times 1\) and \(b^{(2)}\) is \(1\times 1\) with a total number of \(12+4+4+1=21\) parameters.
This model is different from the classical logistic regression model since we have introduced the non-linear part in the model.
Summarize a general neural network or architecture:
Input layer: holds the input training data
Hidden layer (Fully-connected layer): computes the neurons according to a weight matrix \(W\). We need to specify the number of hidden layers, the number of neurons in each hidden layer and the activation function.
intercept term.Output layer: contains the class scores for classification. The dimension is usually the same as the number of classes.
Neural network described above is flexible to handle
y can have K many levels: we only need to output \(s_1, s_2,...s_K\) one for each outcome.
y can be continuous: output one \(s_1\).
Commonly used activation function can be any of the following non-linear functions:
Rectified Linear Unit (ReLU, pronounce as REL-loo) - Quicker to train!!
\[f=max(0, x)\]
Logit/Sigmoid:
\[f = \frac{1}{1+e^{-x}}\]
Hyperbolic Tangent: - Similar to logit \[f=\tanh(x)\]
The most commonly used activation function is ReLU because it converges faster with simple implement (by thresholding). The leaky ReLU, \(f=max(\epsilon x, x)\), is also used in very deep network.
The following figures show the shape of activation functions
The basic computational unit of the brain is a neuron, as shown in the first figure below. Each neuron receives signals (impulses) from other neurons and processes and produces output signals to other neurons. Analogously, Neural network receives the input matrix, processes the input and output the signal with an activation function \(h\) in the picture or \(f\) in our notation.
Once a model is specified or an architecture is defined, we then need to estimate all the weights and biases (or all the unknown parameters). The standard loss for classification is the cross entropy which is exactly the -log likelihood function. We then minimize the cross entropy or maximize the log likelihood to get all the unknown parameters.
Logistic loss classifier \[L = - \frac{1}{n}\sum_{i=1}^n\sum_{j=1}^K 1 y_{ij} \log(p_{ij})\] where \(p_{i,j}\) is the probability of \(i\) being in class \(j\).
Softmax classifier: one of the most popular choices. Denote the scores in the output layer as \((s_1, \ldots, s_K)\). Then the loss function is \[L = - \frac{1}{n}\sum_{i=1}^n \sum_{j=1}^K 1 y_{ij} \log\Bigg(\frac{e^{s_{y_{ij}}}}{\sum_j e^{s_{ij}} } \Bigg)\]
Remark: For the softmax we take: * \[p_j=Prob(Y=j|x_1, ... x_p) = \frac{e^ {s_j}}{\sum_{j=1}^{j=K} {e^ {s_j}}}\] Here K is number of classes.
As we have seen, the number of parameters grows fast as we specify more hidden layers. Over-parameterization can lead to overfitting. There are two major ways to regularize the model.
To fit the model, we will find the minimizers of all the coefficients like we did in glm. Here the number of parameters are huge. So fast, efficient algorithms are needed. This is where tensor flow gets in. gradient descent combined with the trick of back-propagation did it. (You may skip the remaining paragraphs which are rather technical about the minimization problems.)
In this course, we barely mention how packages actually solve the optimization problems. Optimization is a very general yet crucial field and has been developing for years. Neural networks and deep learning could not have been so popular without recent development in optimization (and of course computation power). Refer to an excellent convex optimization course if you are interested.
The most important optimization techniques for neural networks and deep learning are gradient descent and back-propagation. These are NOT new ideas as neural networks! Gradient descent was invented by Cauchy in 1847 and back-propagation was popularized by Rumelhart, Hinton (one of three Turning Award 2018 laureates) and Williams (17k citations!) in 1986 (Back-propagation was derived by multiple researches in early 60’s).
Gradient descent is an algorithm to find the maximizer/minimizer (using the first-order information, i.e. gradient). There are lots of developments in recent years to accelerate (e.g. stochastic gradient descent (SGD), Momentum, adaptive gradient descent (Adagrad), RMSprop). For neural networks, it is worth mentioning that since the loss function is generally not convex, gradient descent does not guarantee global convergence and is often trapped in saddle points. Research in escaping the saddle point and, in general, non-convex optimization is very active.
There exists other algorithm for example the Newton’s method that uses the second-order information, i.e. Hessian. The second-order methods enjoy a faster convergence rate. In reality, however, computing the Hessian for millions of parameters is very expensive. Hence it it not common to apply second-order methods to large-scale deep learning.
Back-propagation is a technique to calculate the gradients for gradient descent. It is a simple yet power idea using chain rule. Each iteration needs to update millions of parameters using gradient descent and thus needs to compute millions of gradients. Back-propagation starts from the last layer and computes the gradients for parameters of the last layer, and then backprops (passes the gradients) to the previous layer to compute the gradients for parameters of the second last layer and so on. Backpropagation then allows us to efficiently compute the gradients and then update the parameters.
A detailed explanation about GD/SGD is provided in the Appendix!!
We are now ready to use Keras to implement neural networks with real case studies.
How well can we predict ratings from a review? The Yelp data available to us here is 100000 previous reviews together with ratings. Use bag of words, each review is already processed in the term frequency format. There are \(p=1072\) words retained. The response variable is a binary rating, 1 being a good review and 0 being a bad review.
In this section we will implement a Neural Network model with two layers and varying number of neutrons. Package keras is used.
Read the data: YELP_tm_freq.csv is a processed the Yelp data. It contains the response ratingsand reviews. data3 extracted: the first variable is the response rating and the remaining variables are words or input of \(x_1, \ldots, x_{1072}\).
# Data prep
data2 <- fread("YELP_TM_freq.csv") # fread the term freq tables
names(data2)[c(1:5, 500:510)] # notice that user_id, stars and date are in the data2
dim(data2)
data3 <- data2[, -c(1:3)]; dim(data3)# the first element is the rating
names(data3)[1:3]
levels(as.factor(data3$rating))Data preparation for NN:
Validation data: data3_val: reserve 10,000
# Split data
set.seed(1) # for the purpose of reproducibility
n <- nrow(data3)
validation.index <- sample(n, 10000)
length(validation.index) # reserve 10000
data3_val <- data3[validation.index, ] #
## validation input/y
data3_xval <- as.matrix(data3_val[, -1]) # make sure it it is a matrix
data3_yval <- as.matrix(data3_val[, 1]) # make sure it it is a matrixTraining data: data3_xtrain/data3_ytrain: Use 90,000 as the training data. Internally we need to split this training data again to tune the parameters. Keras refer the split internal datasets as training/validation. (We have call this pair training/testing)
## training input/y: need to be matrix/vector
data3_xtrain <- data3[-validation.index, -1] #dim(data3_xtrain)
data3_ytrain <- data3[-validation.index, 1]
data3_xtrain <- as.matrix(data3_xtrain) # make sure it it is a matrix
data3_ytrain <- as.matrix(data3_ytrain) # make sure it it is a matrixNow the data is prepared, let’s begin working with our full dataset using the keras package. The first step is to define our model. By defining, we mean specifying the number and type of layers to include in our model and what will happen to our data in each layer. For our purposes keras_model_sequential() is the function to run when defining our model.
To add a fully connected or dense layer we use the command layer_dense().
When building the layer, the shape of the input needs to be specified. This refers to the length of each input vector, which in our case is 1072 (the 1072 possible words in our frequency dictionary).
We also need to specify how many hidden units our layer will contain (these are represented by the nodes or neurons in the earlier graphic). Having more hidden units (a higher-dimensional representation space) allows your network to learn more-complex representations, but it makes the network more computationally expensive and may lead to learning unwanted patterns. Here we use 16 neurons in the first layer and 8 neurons in the second layer.
Finally, we need to specify the activation function. Here we use the rectified linear unit or ReLU function mentioned earlier. This has the effect of zeroing out negative values (equivalent to \(\max(0, X)\)).
The final layer of the model specifies our output. As this is a classification problem we want to find \(P(Y=1)\). To set our output as a probability we specify the activation to be the “sigmoid” function (which as you recall is just the logistic function for which we are familiar.)
Define the Model/Architecture: + two layers with 16 and 8 neurons in each layer + Activation function is Relu + Output layer is Sigmoid
# set seed for keras
use_session_with_seed(10)## Set session seed to 10 (disabled GPU, CPU parallelism)
p <- dim(data3_xtrain)[2] # number of input variables
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu", input_shape = c(p)) %>%
# 1 layer with 16 neurons. default activation is relu
layer_dense(units = 8, activation = "relu") %>%
# layer 2 with 8 neurons
layer_dense(units = 2, activation = "softmax") # output
print(model)## Model
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## dense (Dense) (None, 16) 17168
## ___________________________________________________________________________
## dense_1 (Dense) (None, 8) 136
## ___________________________________________________________________________
## dense_2 (Dense) (None, 2) 18
## ===========================================================================
## Total params: 17,322
## Trainable params: 17,322
## Non-trainable params: 0
## ___________________________________________________________________________
As we can see in the table above our model has a total of 17322 parameters:
Next we need to compile our model. This step refers to specifying the optimizer logarithm we want to use (we will always use ‘rmsprop’), the loss function (since this is a binary classification problem we use ‘binary_crossentropy’) and the metric used to evaluate the performance. (here we use accuracy which is the fraction of reviews that are correctly classified).
Compile the Model
##Compile the Model
model %>% compile(
optimizer = "rmsprop",
loss = "sparse_categorical_crossentropy",
metrics = c("accuracy")
)Before we fit our model, let us explain internally how the model is trained. The data will be split to
For example we may want to use 85% of the data as an internal training data and the remaining data as a validation data by specifying validation_split = 0.15. The training data will be used to get all the parameters and the loss and accuracy will be evaluated by the internal validation data.
We can now fit our model. The batch size refers to the number of samples per batch. This is part of the trick used to update the gradients with a batch number of reviews. The epoch refers to the number of iterations over the training data. Each epoch we cycle through all the data points one batch at a time. For test purposes we use 20 epochs.
Note that compile() and fit() modify the model object in place, unlike most R functions. So after we compiled and fitted model, model has already changed. If we fit again on a fitted model, it will not retrain the model again. The output of the fit() function is history accuracy and loss of training process. So the fit1 in the following chunk will store the training history but NOT the model object.
fit1 <- model %>% fit(
data3_xtrain,
data3_ytrain,
epochs = 20,
batch_size = 512,
validation_split = .15 # set 15% of the data3_xtain, data3_ytrain as the validation data
)
plot(fit1)## `geom_smooth()` using formula 'y ~ x'
All 17323 parameters With our model now fit, lets take a look at the values for the 17313 parameters in our model.
weights <- model %>% get_weights()
# str(weights) # show each layers of W's and b's
# hist(weights[[1]]) # W^(1)
# weights[[2]] # b's for layer 1Predictions: Lets see if we can gain further intuition by manually using the above parameter values for prediction. Below are the predicted probabilities from the model for the first five values in our training set.
round(model %>% predict(data3_xtrain[1:5,]), 3)## [,1] [,2]
## [1,] 0.016 0.984
## [2,] 0.258 0.742
## [3,] 0.654 0.346
## [4,] 0.005 0.995
## [5,] 0.464 0.536
Lets see if we can manually compute these probabilities for the first five reviews in our training set using the computed weights shown previously. If you understand this chunk you have mastered basic elements of NN.
n5 <- 5
# first layer: z_1 = W_1 X + b_1; a_1 = ReLU(z_1)
z_1 <- data3_xtrain[1:n5, ] %*% weights[[1]]
# add beta (weights[[2]]) to every row
z_1 <- z_1 + matrix(rep(weights[[2]], n5), nrow = n5, byrow = T)
a_1 <- matrix(pmax(0, z_1), nrow = n5)
# second layer: z_2 = W_2 a_1 + b_2; a_2 = ReLU(z_2)
z_2 <- a_1 %*% weights[[3]]
z_2 <- z_2 + matrix(rep(weights[[4]], n5), nrow = n5, byrow = T)
a_2 <- matrix(pmax(0, z_2), nrow = n5)
# output layer: softmax(W_3 a_2 + b_3)
z_out <- a_2 %*% weights[[5]]
z_out <- z_out + matrix(rep(weights[[6]], n5), nrow = n5, byrow = T)
prob.pred <- t(apply(z_out, 1, function(z) exp(z)/sum(exp(z))))
round(prob.pred, 3)## [,1] [,2]
## [1,] 0.016 0.984
## [2,] 0.258 0.742
## [3,] 0.654 0.346
## [4,] 0.005 0.995
## [5,] 0.464 0.536
Its really that simple! The model would then continue computing the probabilities for the remainder of the reviews in our training set and then feed these numbers into our loss function. The value computed by our loss function on our training data is as follows:
fit1$metrics$loss[20] # fit1$metrics keeps all the evaluations## [1] 0.3246938
From the graph below we see that by about 6 epochs our validation loss has bottomed out and we receive no further benefit from additional iterations
plot(fit1)## `geom_smooth()` using formula 'y ~ x'
To avoid overfitting lets use 6 epochs in our final model. (We could also play with unit numbers and batch size to see if this has an impact). Now that we are ready to fit our final model we can use all of training data.
Final training with all the training data:
Here we have put all the steps together to get the final NN predictive equation * Training data: data3_xtrain, data3_ytrain * Validation data: data3_xvalidation, data3_yvalidation * NN model: + two layers with 16 and 8 neurons in each layer + Activation function is Relu + Output layer is Sigmoid * Epoch is 6
p <- dim(data3_xtrain)[2] # number of input variables
#retain the nn:
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu", input_shape = c(p)) %>%
# 1 layer with 16 neurons. default activation is relu
layer_dense(units = 8, activation = "relu") %>% # layer 2 with 8 neurons
layer_dense(units = 2, activation = "softmax") # output
model %>% compile(
optimizer = "rmsprop",
loss = "sparse_categorical_crossentropy",
metrics = c("accuracy")
)
model %>% fit(data3_xtrain, data3_ytrain, epochs = 6, batch_size = 512)We can use the evaluate function to assess our performance on the validation data to report a final performance of the model built.
results <- model %>% evaluate(data3_xval, data3_yval) ; results## $loss
## [1] 0.4193319
##
## $acc
## [1] 0.8084
Our accuracy on the validation data is an impressive 81% (19% of mis-classification error) with only two fully connected layers! Meaning that we correctly classified 80% of the reviews as positive or negative (this is a misclassification error of 20). With additional layers and different types of layers we could improve on this even further.
Finally we can do prediction. Let us see how well we predict the first 5 reviews in the validation data.
Get probabilities:
pred.prob <- model %>% predict(data3_xval[1:5,])
pred.prob## [,1] [,2]
## [1,] 0.3160712 0.6839288
## [2,] 0.8966123 0.1033877
## [3,] 0.7350608 0.2649392
## [4,] 0.2509769 0.7490231
## [5,] 0.4209864 0.5790135
Get the labels:
y.pred <- model %>% predict_classes(data3_xval[1:5,]) # majority vote!
y.pred## [1] 1 0 0 1 1
data.frame(yhat=y.pred, y=data3_yval[1:5, 1])## yhat y
## 1 1 0
## 2 0 0
## 3 0 0
## 4 1 1
## 5 1 1
We have introduced Neural network and immplemented it using keras. Neural network provides a non-linear model to predict response variables. They can be catogorical or continuous responses.
We have run Yelp review case study. Since we have used bag of words to process the data. So the advantage of Neural nets are not seen here.
In the next section we will apply NN’s to image recognition case study. The accuracy of prediction is shown much better there.
Objective: classify handwritten numbers from 0 to 9 correctly
The data is available with keras package and it can be loaded as follows:
mnist <- dataset_mnist()
c(c(train_images, train_labels), c(test_images, test_labels)) %<-% mnistLets take a look at the data structure. The x are images of size 28x28 pixels and y labels which digits the image actually is.
dim(train_images)## [1] 60000 28 28
#print(train_images[1, , ]) # the pixels are described in a matrix of 28 by 28 already
dim(train_labels)## [1] 60000
levels(as.factor(train_labels)) # 10 digits## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
table(as.factor(train_labels)) # frequencies##
## 0 1 2 3 4 5 6 7 8 9
## 5923 6742 5958 6131 5842 5421 5918 6265 5851 5949
Let’s take a look at the first 30 images in the training set.
index <- 1:30
par(mfcol = c(5,6), mar = rep(1, 4), oma = rep(0.2, 4))
mnist$train$x[index,,] %>%
purrr::array_tree(1) %>%
purrr::set_names(mnist$train$y[index]) %>%
purrr::map(as.raster, max = 255) %>%
purrr::iwalk(~{plot(.x); title(.y)})Images are just a matrix of pixels. Usually we use three color channels, red, green and blue. Each pixel is a 3-tuple that indicates how much we want from (red, green, blue) ranging from 0 to 255. For example, (255, 0, 0) represents red and (255, 255, 0) represents yellow. Black and white is represented as (0,0,0) and (255,255,255) respectively. For black and white pictures, there will be only one color channel so each pixel only has one number indicating darkness, which is called greyscale. To summarize,
We can vectorize the 2D matrix using the function array_reshape() as follows:
train_images <- array_reshape(train_images, c(60000, 28*28)) #correct shape to what is expected by keras
#train_images[1, ]
train_images <- train_images /255 #make values between 0 and 1
dim(train_images)## [1] 60000 784
###Vectorize the data
test_images <- array_reshape(test_images, c(10000, 28*28))
test_images <- test_images / 255
###Vectorize the labels
train_labels <- to_categorical(train_labels) #convert to vector of length of labels, with 0 and 1s to represent if label value is true or false
test_labels <- to_categorical(test_labels)It is an easy extension to handle multiple classes from a binary settings. We will use logit function to model the probability of y being in each class j.
We are now ready to build a NN to first estimate the probability of each input and use polarity vote to get a label.
The set up for a multiclass classification neural network is very similar to what we did for binary classification, with a few notable differences. Firstly, each input now has two dimensions (a 28x28 matrix representing the height and width of each image). To save time, the optimal values for the hyper parameters have been identified using the methods described previously. The output unit value is 10 as we have 10 possible label classifications: 0 to 9.
The compiler settings are all the same except that for our loss function we use categorical cross entropy rather than binary cross entropy. This is similar to our binary function implementation, except that it allows for multiple classifications.
We start a NN model with 1 layer, 512 neurons and activated with relu function. The probability of each outcome is obtained via softmax transformation.
##Define the model
model <- keras_model_sequential() %>%
layer_dense(units = 512, activation = 'relu', input_shape = c(28*28)) %>%
layer_dense(units = 10, activation = 'softmax')
##Compile the model
model %>% compile(
optimizer = "rmsprop",
loss = "categorical_crossentropy",
metrics = c("accuracy")
)
##fit the model
mod_fit <- model %>% fit(
#train_images, train_labels, epochs = 40, batch_size=128, validation_split = 0.2 # epoths = 10 is enough
train_images, train_labels, epochs = 10, batch_size=128, validation_split = 0.2
)
plot(mod_fit)## `geom_smooth()` using formula 'y ~ x'
Lets see how we perform on the test data:
results <- model %>% evaluate(test_images, test_labels)
results## $loss
## [1] 0.07858416
##
## $acc
## [1] 0.9781
Our accuracy is an amazing 98%! Meaning that we correctly classified the number images 98% of the time!
With our model trained, lets have some fun. Open up MS paint or similar paint software and do the following:
Note the following: prior to running our image through our model we need to pre-process the image to be in the same format as the samples used to fit our model. These samples are 28x28, with the number centered in a 20x20 box (i.e. there is 4x4 white space border around the number).
library(imager)
image <- load.image('img/Untitled.png') ;
plot(image) #Load image and plot
##If image has white background with black text do (1-image below) to invert colors, otherwise replace (1-image) with (image)
image <- (1- image) %>% autocrop() %>% grayscale() #Make grayscale and crop out white space
image <- imager::imresize(image, min(20/height(image), 20/width(image))) #Resize image to fit in 20 x 20 box
image <- pad(image,28-width(image), pos = 0, "x") #Center image and create top and bottom border
image <- pad(image,28-height(image), pos = 0, "y")#Center image and create border left and right border
image <- resize(image, size_y = 28, size_x = 28) #Ensures that dimensions are 28 x 28
plot(image)
tester <- aperm(image, c(4,3,2,1)) #Inverse array format to be consistent with input format
image_test <- array_reshape(tester, c(1, 28*28)) #Convert to vector to be consistent with input format
prediction <- model %>% predict(image_test) # round(prediction, 2), almost with 100% sure to predict the image being a 9
paste("The number you drew was a", which.max(prediction[,]) - 1)Hooray…. we predicted it right, an awesome 9!
We have stretched out the pixels into a vector of 784 dimension. Clearly the spacial structure is not preserved. In a seperate lecture we introduce convolution neural network which will keep the local information at each pixel as an input. It works much better for image data.
We took a glimpse at neural network in this lecture. We learned how to model the perception process of human being using neural network. The basic elements of the neural network architecture are neurons and layers. There are lots of tuning parameters and thousands or millions of parameters to estimate.
Using the keras package, we apply neural network to IMBD and MNIST. Since MNIST is image, we learned how to handle image data. We also did a real prediction using the model we trained.
Lets compare the performance of our neural network to other machine learning approaches we have learned.
Let’s start with something simple, a quadratic function \(f(\theta) = \theta^2\). We want to find the minimizer that minimizes \(f(\theta)\). We know that when \(\theta = 0\), \(f(\theta) = 0\) and it reaches the minimal value. So \(\theta^* = 0\) is our minimizer.
# quadratic function
square <- function(theta) theta^2
theta_seq <- seq(-2,2,.1)
plot(theta_seq, square(theta_seq), type = 'l')How do we find the minimizer \(\theta^* = 0\) using gradient descent? We use the gradient/derivative to guide us with direction that moves to the minimal. We take a small step every time.
In this example, the derivative is \(f'(\theta) = 2\theta\). Every step, we move towards \(-\alpha f'(\theta)\) where \(\alpha\) is the learning step. Let’s take \(\alpha = 0.1\) for now. If we initiate \(\theta^{(0)} = 2\), then the derivative at \(\theta = 0\) is \(2\theta = 2\times 2 = 4\). The first step is \(\theta^{(1)} = \theta^{(0)} -\alpha f'(\theta^{(0)}) = 2 - 0.1\times 4 = 1.6\). We iterate until the derivative \(|f'(\theta^{(t)})| < \epsilon\) where \(\epsilon\) is a very small number, which means \(\theta^{(t)}\) is very close to \(\theta^*\).
Algorithm 1: Gradient descent for one parameter
Initiate \(\theta^{(0)}\)
While (\(|f'(\theta^{(t)})| < \epsilon\))
\[\theta^{(t+1)} \leftarrow \theta^{(t)} -\alpha f'(\theta^{(t)})\]
The following chunk is to illustrate the gradient descent method. You may skip it.
# set up learning rate
learning_rate <- 0.1Now try different learning rate see how it affects the convergence. (especially when the learning rate is 1.)
Now we move on to a more realistic case: OLS. Recall that the loss function for OLS is \[ L(x,y; \theta) = \frac{1}{n} \sum_{i=1}^n [(y_i - \theta^\top x_i)]^2 \]
For OLS we can write out the explicit solution, so we do not need gradient descent to solve OLS. However not every optimization problem has an explicit solution. Using gradient descent to solve OLS is for pedagogical reason.
Let’s consider only two variables \(x_1\) and \(x_2\) and the response is \(y = \beta_1 x_1 + \beta_2 x_2 + \epsilon\). We want to use gradient descent to estimate \(\beta_1\) and \(\beta_2\). The loss function is \[ L(x,y; \theta) = \frac{1}{n} \sum_{i=1}^n [(y_i - \theta_1 x_{1,i} - \theta_2 x_{2,i}) ]^2 \]
To get the gradient \[\begin{align} \frac{\partial L}{\partial \theta_1} &= - \frac{2}{n} \sum_{i=1}^n x_{1,i} (y - \theta_1 x_{1,i} - \theta_2 x_{2,i})\\ \frac{\partial L}{\partial \theta_2} &= - \frac{2}{n} \sum_{i=1}^n x_{2,i} (y - \theta_1 x_{1,i} - \theta_2 x_{2,i}) \end{align}\]
Then the update step is \[\begin{align} \theta_1^{(t+1)} &\leftarrow \theta_1^{(t)} + \alpha \frac{2}{n} \sum_{i=1}^n x_{1,i} (y - \theta_1^{(t)} x_{1,i} - \theta_2^{(t)} x_{2,i})\\ \theta_2^{(t+1)} &\leftarrow \theta_2^{(t)} + \alpha \frac{2}{n} \sum_{i=1}^n x_{2,i} (y - \theta_1^{(t)} x_{1,i} - \theta_2^{(t)} x_{2,i}) \end{align}\]
To write it in a compact form \[ \boldsymbol\theta^{(t+1)} \leftarrow \boldsymbol\theta^{(t)} - \alpha \nabla L(\boldsymbol\theta^{(t)} ) \] where \(\alpha\) is learning rate and \(\nabla L(\boldsymbol\theta)\) is the gradient for the loss function \(L(\boldsymbol\theta)\).
Algorithm 2: Gradient descent for multiple parameters
Initiate \(\theta^{(0)}\)
While (\(\|\nabla L(\boldsymbol\theta^{(t)})\|_2 < \epsilon\))
\[\boldsymbol\theta^{(t+1)} \leftarrow \boldsymbol\theta^{(t)} - \alpha \nabla L(\boldsymbol\theta^{(t)} )\]
Let’s \(y = x_1 + 2 x_2 + \epsilon\) as example, i.e, \(\beta_1 = 1\) and \(\beta_2 = 2\) and we generate 100 \((x_{1,i}, x_{2_i}, y_i)\). we want to estimate \(\beta_1\) and \(\beta_2\) using 100 observed \((x_{1,i}, x_{2_i}, y_i)\). To do so, we minimize our loss function RSS by iteratively updating our estimate of \(\beta_1\) and \(\beta_2\) following Algorithm 2. The following illustrates the path of our estimate using Algorithm 2.
n <- 100
beta <- c(1,2)
x <- cbind(rnorm(n), rnorm(n))
y <- x%*% beta + rnorm(n,.1)For complex problem, calculating the gradient using all the samples for each update can be computationally intensive. Stochastic gradient descent provides a way to estimate the gradient using fewer samples for each step so that computation for each update is reduced.
Note that the loss function is a sum, so as the gradient. We can then take one or a few data points to estimate the full gradient. Taking one data point for each update is called vanilla SGD; while taking batches of data points to called batched SGD. However, SGD takes longer than GD to converge. There are many methods to improve SGD so that it converges as fast as GD but also computationally efficient.
We usually see the convergence plot with x-axis being number of epochs and y-axis being the training and testing error. One epoch means the full data is used for updated. For example, if we use 10 batch SGD, i.e. we split the data into 10 folds and we update using each of the 10 folds imperatively. We updated 10 times but it only counts as 1 epoch. The reason of using epoch instead of update times is for a fair comparison among optimization methods.
Let’s apply random forest using ranger library to MNIST dataset and how it performs.
Read and restructure the data.
mnist <- dataset_mnist()
c(c(train_images, train_labels), c(test_images, test_labels)) %<-% mnist
train_images <- array_reshape(train_images, c(60000, 28*28)) #corrects shape to what is expected by keras
train_images <- train_images / 255 #make values between 0 and 1
###Vectorize the data
test_images <- array_reshape(test_images, c(10000, 28*28))
test_images <- test_images / 255
###Vectorize the labels
train_labels <- to_categorical(train_labels) #convert to vector of length of labels, with 0 and 1s to represent if label value is true or false
test_labels <- to_categorical(test_labels)Train the model.
# faster implementation of random forests
pvals <- 0:9
rf_y <- rep(0, nrow(train_labels))
for (i in 1:nrow(train_labels)) {
rf_y[i] <- pvals[train_labels[i,]==1]
}
rf_data <- data.frame(y = as.factor(rf_y), x = train_images)
rf <- ranger::ranger(y ~ ., rf_data, num.trees = 200)Testing random forest we get an error rate of ~2.86% (Accuracy of ~97%). Not bad!
rf_y_test <- rep(0, nrow(test_labels))
for (i in 1:nrow(test_labels)) {
rf_y_test[i] <- pvals[test_labels[i,]==1]
}
rf_data_test <- data.frame(y = as.factor(rf_y_test), x = test_images)
pred_rf <- predict(rf, data = rf_data_test[,-1])
err <- mean(pred_rf$predictions != rf_y_test) #for ranger
#err <- mean(pred_rf != MNIST::mnist_test$y)
err * 100